AITopics | document image

Collaborating Authors

document image

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unified Pretraining Framework for Document Understanding

Neural Information Processing SystemsApr-24-2026, 09:33:36 GMT

Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated.

information retrieval, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
(2 more...)

Add feedback

af9199ef7d5ff267451a667170124c04-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-17-2026, 11:36:31 GMT

dataset, specular highlight, specular highlight removal model, (12 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Semiconductors & Electronics (0.46)
Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

0d97fe65d7a1dc12a05642d9fa4cd578-Paper-Conference.pdf

Neural Information Processing SystemsFeb-7-2026, 20:55:00 GMT

In this paper, we present a novel large vision-language model, TabPedia, equipped with aconcept synergy mechanism.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > Japan > Kyūshū & Okinawa > Kyūshū > Fukuoka Prefecture > Fukuoka (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance

Kim, Chaewon, Lee, Seoyeon, Park, Jonghyuk

arXiv.org Artificial IntelligenceDec-10-2025

Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. T o effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HF AM) that decomposes and adap-tively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that Matte-ViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.

machine learning, pattern recognition, shadow removal, (17 more...)

arXiv.org Artificial Intelligence

2512.08789

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Kolavi, Adithya S, Jain, Vyoman

arXiv.org Artificial IntelligenceDec-4-2025

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.03514

Country:

Asia (0.67)
North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

What Shape Is Optimal for Masks in Text Removal?

Nakada, Hyakka, Kubota, Marika

arXiv.org Artificial IntelligenceDec-1-2025

The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.

machine learning, natural language, text removal, (18 more...)

arXiv.org Artificial Intelligence

2511.22499

Country:

Europe (0.67)
Asia > Japan > Honshū (0.28)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

Du, Yongkun, Chen, Pinxuan, Ying, Xuye, Chen, Zhineng

arXiv.org Artificial IntelligenceNov-25-2025

The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. T o fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.

artificial intelligence, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2511.18434

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)

Add feedback

Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Sasagawa, Keito, Kurita, Shuhei, Kawahara, Daisuke

arXiv.org Artificial IntelligenceNov-20-2025

Multimodal Large Language Models (MLLMs) have seen rapid advances in recent years and are now being applied to visual document understanding tasks. They are expected to process a wide range of document images across languages, including Japanese. Understanding documents from images requires models to read what are written in them. Since some Japanese documents are written vertically, support for vertical writing is essential. However, research specifically focused on vertically written Japanese text remains limited. In this study, we evaluate the reading capability of existing MLLMs on vertically written Japanese text. First, we generate a synthetic Japanese OCR dataset by rendering Japanese texts into images, and use it for both model fine-tuning and evaluation. This dataset includes Japanese text in both horizontal and vertical writing. We also create an evaluation dataset sourced from the real-world document images containing vertically written Japanese text. Using these datasets, we demonstrate that the existing MLLMs perform worse on vertically written Japanese text than on horizontally written Japanese text. Furthermore, we show that training MLLMs on our synthesized Japanese OCR dataset results in improving the performance of models that previously could not handle vertical writing. The datasets and code are publicly available https://github.com/llm-jp/eval_vertical_ja.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.15059

Country:

North America > United States (0.68)
Europe (0.67)
Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Seeing Straight: Document Orientation Detection for Efficient OCR

Goswami, Suranjan, Ravi, Abhinav, Kolla, Raja, Faraz, Ali, Khan, Shaharukh, Akash, null, Khatri, Chandra, Agarwal, Shubham

arXiv.org Artificial IntelligenceNov-7-2025

Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.

arxiv preprint arxiv, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2511.04161

Genre: Research Report > New Finding (0.48)

Technology: